On Airbnb, anyone who has a room or property of any type (apartment, house, cottage, inn, etc.) can offer it for rent on a daily basis.
The host create a profile and also an announcement about the property.
In this announcement, the host should describe the property's characteristics as comprehensively as possible to help renters/travelers choose the best property for them.
There are numerous customizations available in the announcement, including minimum stay requirements, price, number of rooms, cancellation policies, extra fees for additional guests, requirement of identity verification for renters, etc.
To build a price prediction model that allows common individuals who own a property to determine how much they should charge per night for their property.
Alternatively, for common renters, given the property they are seeking, to help determine if that property is competitively priced (below the average for properties with similar characteristics) or not.
The datasets were obtained from the Kaggle website: https://www.kaggle.com/allanbruno/airbnb-rio-de-janeiro
The datasets contain property prices and their respective characteristics for each month. The prices are given in Brazilian Real (BRL). We have data from April 2018 to May 2020, with the exception of June 2018, which does not have a dataset.
I believe seasonality can be an important factor, as months like December tend to be quite expensive in Rio de Janeiro. The location of the property should make a significant difference in the price since in Rio de Janeiro, location can completely change the characteristics of a place (safety, natural beauty, tourist attractions). Additional amenities/facilities may have a significant impact, considering the presence of many old buildings and houses in Rio de Janeiro. We will discover how much these factors impact prices and if there are other less intuitive factors that are extremely important.
import pandas as pd
import pathlib
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split
months = {'jan': 1, 'feb':2, 'mar':3, 'apr': 4, 'may':5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}
database_path = pathlib.Path('dataset')
base_airbnb = pd.DataFrame()
for file in database_path.iterdir():
month_name = file.name[:3]
month = months[month_name]
year = file.name[-8:]
year = int(year.replace('.csv', ''))
df = pd.read_csv(database_path / file.name)
df['year'] = year
df['month'] = month
base_airbnb = base_airbnb.append(df)
display(base_airbnb)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (62,87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(database_path / file.name) C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. base_airbnb = base_airbnb.append(df)
| id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | ... | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | number_of_reviews_ltm | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14063 | https://www.airbnb.com/rooms/14063 | 20180414160018 | 2018-04-14 | Living in a Postcard | Besides the most iconic's view, our apartment ... | NaN | Besides the most iconic's view, our apartment ... | none | Best and favorite neighborhood of Rio. Perfect... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 17878 | https://www.airbnb.com/rooms/17878 | 20180414160018 | 2018-04-14 | Very Nice 2Br - Copacabana - WiFi | Please note that special rates apply for New Y... | - large balcony which looks out on pedestrian ... | Please note that special rates apply for New Y... | none | This is the best spot in Rio. Everything happe... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 24480 | https://www.airbnb.com/rooms/24480 | 20180414160018 | 2018-04-14 | Nice and cozy near Ipanema Beach | My studio is located in the best of Ipanema. ... | The studio is located at Vinicius de Moraes St... | My studio is located in the best of Ipanema. ... | none | The beach, the lagoon, Ipanema is a great loca... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 25026 | https://www.airbnb.com/rooms/25026 | 20180414160018 | 2018-04-14 | Beautiful Modern Decorated Studio in Copa | Our apartment is a little gem, everyone loves ... | This newly renovated studio (last renovations ... | Our apartment is a little gem, everyone loves ... | none | Copacabana is a lively neighborhood and the ap... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 31560 | https://www.airbnb.com/rooms/31560 | 20180414160018 | 2018-04-14 | NICE & COZY 1BDR - IPANEMA BEACH | This nice and clean 1 bedroom apartment is loc... | This nice and clean 1 bedroom apartment is loc... | This nice and clean 1 bedroom apartment is loc... | none | Die Nachbarschaft von Ipanema ist super lebend... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 34324 | 38844730 | https://www.airbnb.com/rooms/38844730 | 20190923212307 | 2019-09-24 | TRANSAMERICA BARRA DA TIJUCA R IV | Em estilo contemporâneo, o Transamerica Prime ... | NaN | Em estilo contemporâneo, o Transamerica Prime ... | none | NaN | ... | 1.0 | 1.0 | 1125.0 | 1125.0 | 1.0 | 1125.0 | 0.0 | 15.0 | 0.0 | 0.0 |
| 34325 | 38846408 | https://www.airbnb.com/rooms/38846408 | 20190923212307 | 2019-09-24 | Alugo para o Rock in Rio | Confortável apartamento, 2 quartos , sendo 1 s... | O apartamento estará com mobília completa disp... | Confortável apartamento, 2 quartos , sendo 1 s... | none | Muito próximo ao Parque Olímpico, local do eve... | ... | 2.0 | 2.0 | 1125.0 | 1125.0 | 2.0 | 1125.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 34326 | 38846703 | https://www.airbnb.com/rooms/38846703 | 20190923212307 | 2019-09-24 | Apt COMPLETO em COPACABANA c/TOTAL SEGURANÇA | Apartamento quarto e sala COMPLETO para curtas... | Espaço ideal para até 5 pessoas. Cama de casal... | Apartamento quarto e sala COMPLETO para curtas... | none | NaN | ... | 3.0 | 3.0 | 1125.0 | 1125.0 | 3.0 | 1125.0 | 0.0 | 23.0 | 6.0 | 0.0 |
| 34327 | 38847050 | https://www.airbnb.com/rooms/38847050 | 20190923212307 | 2019-09-24 | Cobertura Cinematografica | Cobertura alto nivel | NaN | Cobertura alto nivel | none | NaN | ... | 1.0 | 1.0 | 1125.0 | 1125.0 | 1.0 | 1125.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 34328 | 38847655 | https://www.airbnb.com/rooms/38847655 | 20190923212307 | 2019-09-24 | Quarto em cobertura em frente à praia III | Quarto em cobertura quadriplex com vista lindí... | NaN | Quarto em cobertura quadriplex com vista lindí... | none | NaN | ... | 1.0 | 1.0 | 30.0 | 30.0 | 1.0 | 30.0 | 0.0 | 0.0 | 4.0 | 0.0 |
902210 rows × 108 columns
Moreover, a quick analysis reveals that several columns are not necessary for our prediction model. Therefore, we will exclude some columns from our dataset.
Types of columns we will exclude:
IDs, links, and irrelevant information for the model Repeated or extremely similar columns that provide the same information to the model (e.g., Date vs. Year/Month) Columns filled with free-text -> We won't run any word analysis or similar processes Columns where all or almost all values are the same To do this, we will create an Excel file with the first 1,000 records and perform a qualitative analysis by examining the columns and identifying which ones are unnecessary.
print(list(base_airbnb.columns))
base_airbnb.head(1000).to_csv('first_registers.csv')
['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space', 'description', 'experiences_offered', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet', 'price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped', 'number_of_reviews', 'first_review', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'requires_license', 'license', 'jurisdiction_names', 'instant_bookable', 'is_business_travel_ready', 'cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 'calculated_host_listings_count', 'reviews_per_month', 'year', 'month', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'number_of_reviews_ltm', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms']
columns = ['host_response_time','host_response_rate','host_is_superhost','host_listings_count','latitude','longitude','property_type','room_type','accommodates','bathrooms','bedrooms','beds','bed_type','amenities','price','security_deposit','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights','number_of_reviews','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value','instant_bookable','is_business_travel_ready','cancellation_policy','year','month']
base_airbnb = base_airbnb.loc[:, columns]
print(list(base_airbnb.columns))
display(base_airbnb)
['host_response_time', 'host_response_rate', 'host_is_superhost', 'host_listings_count', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'is_business_travel_ready', 'cancellation_policy', 'year', 'month']
| host_response_time | host_response_rate | host_is_superhost | host_listings_count | latitude | longitude | property_type | room_type | accommodates | bathrooms | ... | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | instant_bookable | is_business_travel_ready | cancellation_policy | year | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | f | 1.0 | -22.946854 | -43.182737 | Apartment | Entire home/apt | 4 | 1.0 | ... | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | f | f | strict_14_with_grace_period | 2018 | 4 |
| 1 | within an hour | 100% | t | 2.0 | -22.965919 | -43.178962 | Condominium | Entire home/apt | 5 | 1.0 | ... | 9.0 | 10.0 | 10.0 | 9.0 | 9.0 | t | f | strict | 2018 | 4 |
| 2 | within an hour | 100% | f | 1.0 | -22.985698 | -43.201935 | Apartment | Entire home/apt | 2 | 1.0 | ... | 10.0 | 10.0 | 10.0 | 10.0 | 9.0 | f | f | strict | 2018 | 4 |
| 3 | within an hour | 100% | f | 3.0 | -22.977117 | -43.190454 | Apartment | Entire home/apt | 3 | 1.0 | ... | 10.0 | 10.0 | 10.0 | 10.0 | 9.0 | f | f | strict | 2018 | 4 |
| 4 | within an hour | 100% | t | 1.0 | -22.983024 | -43.214270 | Apartment | Entire home/apt | 3 | 1.0 | ... | 10.0 | 10.0 | 10.0 | 10.0 | 9.0 | t | f | strict | 2018 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 34324 | within an hour | 93% | f | 0.0 | -23.003180 | -43.342840 | Apartment | Entire home/apt | 4 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | f | f | flexible | 2019 | 9 |
| 34325 | NaN | NaN | f | 0.0 | -22.966640 | -43.393450 | Apartment | Entire home/apt | 4 | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | f | f | flexible | 2019 | 9 |
| 34326 | within a few hours | 74% | f | 32.0 | -22.962080 | -43.175520 | Apartment | Entire home/apt | 5 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | f | f | strict_14_with_grace_period | 2019 | 9 |
| 34327 | NaN | NaN | f | 0.0 | -23.003400 | -43.341820 | Apartment | Entire home/apt | 4 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | f | f | strict_14_with_grace_period | 2019 | 9 |
| 34328 | a few days or more | 38% | f | 5.0 | -23.010560 | -43.363350 | Apartment | Private room | 2 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | f | f | strict_14_with_grace_period | 2019 | 9 |
902210 rows × 34 columns
for column in base_airbnb:
if base_airbnb[column].isnull().sum() > 300000:
base_airbnb = base_airbnb.drop(column, axis=1)
print(base_airbnb.isnull().sum())
host_is_superhost 460 host_listings_count 460 latitude 0 longitude 0 property_type 0 room_type 0 accommodates 0 bathrooms 1724 bedrooms 850 beds 2502 bed_type 0 amenities 0 price 0 guests_included 0 extra_people 0 minimum_nights 0 maximum_nights 0 number_of_reviews 0 instant_bookable 0 is_business_travel_ready 0 cancellation_policy 0 year 0 month 0 dtype: int64
base_airbnb = base_airbnb.dropna()
print(base_airbnb.shape)
print(base_airbnb.isnull().sum())
(897709, 23) host_is_superhost 0 host_listings_count 0 latitude 0 longitude 0 property_type 0 room_type 0 accommodates 0 bathrooms 0 bedrooms 0 beds 0 bed_type 0 amenities 0 price 0 guests_included 0 extra_people 0 minimum_nights 0 maximum_nights 0 number_of_reviews 0 instant_bookable 0 is_business_travel_ready 0 cancellation_policy 0 year 0 month 0 dtype: int64
print(base_airbnb.dtypes) #Printing the data types
print('-'*60)
print(base_airbnb.iloc[0]) #Visualizing only the first row of each column to analyze the content
host_is_superhost object
host_listings_count float64
latitude float64
longitude float64
property_type object
room_type object
accommodates int64
bathrooms float64
bedrooms float64
beds float64
bed_type object
amenities object
price object
guests_included int64
extra_people object
minimum_nights int64
maximum_nights int64
number_of_reviews int64
instant_bookable object
is_business_travel_ready object
cancellation_policy object
year int64
month int64
dtype: object
------------------------------------------------------------
host_is_superhost f
host_listings_count 1.0
latitude -22.946854
longitude -43.182737
property_type Apartment
room_type Entire home/apt
accommodates 4
bathrooms 1.0
bedrooms 0.0
beds 2.0
bed_type Real Bed
amenities {TV,Internet,"Air conditioning",Kitchen,Doorma...
price $133.00
guests_included 2
extra_people $34.00
minimum_nights 60
maximum_nights 365
number_of_reviews 38
instant_bookable f
is_business_travel_ready f
cancellation_policy strict_14_with_grace_period
year 2018
month 4
Name: 0, dtype: object
#price
base_airbnb['price'] = base_airbnb['price'].str.replace('$', '')
base_airbnb['price'] = base_airbnb['price'].str.replace(',', '')
base_airbnb['price'] = base_airbnb['price'].astype(np.float32, copy=False)
#extra_people
base_airbnb['extra_people'] = base_airbnb['extra_people'].str.replace('$', '')
base_airbnb['extra_people'] = base_airbnb['extra_people'].str.replace(',', '')
base_airbnb['extra_people'] = base_airbnb['extra_people'].astype(np.float32, copy=False)
#verifying the updated data
print(base_airbnb.dtypes)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4223115622.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
base_airbnb['price'] = base_airbnb['price'].str.replace('$', '')
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4223115622.py:7: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
base_airbnb['extra_people'] = base_airbnb['extra_people'].str.replace('$', '')
host_is_superhost object host_listings_count float64 latitude float64 longitude float64 property_type object room_type object accommodates int64 bathrooms float64 bedrooms float64 beds float64 bed_type object amenities object price float32 guests_included int64 extra_people float32 minimum_nights int64 maximum_nights int64 number_of_reviews int64 instant_bookable object is_business_travel_ready object cancellation_policy object year int64 month int64 dtype: object
It's needed to examine feature by feature to:
Let's start with the price columns (the final outcome we want) and extra_people (also a monetary value). These are continuous numerical values.
Then I will analyze columns with discrete numerical values (accommodates, bedrooms, guests_included, etc.).
Finally, I will evaluate text columns and determine which categories make sense to keep or discard.
plt.figure(figsize=(15, 10))
sns.heatmap(base_airbnb.corr(), annot=True, cmap='Blues')
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3685621271.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(base_airbnb.corr(), annot=True, cmap='Blues')
<Axes: >
def fences(column):
q1 = column.quantile(0.25)
q3 = column.quantile(0.75)
iqr = q3 - q1 #iqr = interquartile range
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
return lower_fence, upper_fence
def remove_outliers(df, column_name):
qty_rows = df.shape[0]
lower_fence, upper_fence = fences(df[column_name])
df = df.loc[(df[column_name] >= lower_fence) & (df[column_name] <= upper_fence), :]
removed_rows = qty_rows - df.shape[0]
return df, removed_rows
def boxplot(column):
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(15, 5) #Defining the size of boxplots
sns.boxplot(x=column, ax=ax1)
ax2.set_xlim(fences(column)) #The second boxplot will show only the fence range, without showing the outliers.
sns.boxplot(x=column, ax=ax2)
def histogram(column):
plt.figure(figsize=(15, 5))
sns.distplot(column, hist=True)
def bar_chart(column):
plt.figure(figsize=(15, 5))
ax = sns.barplot(x=column.value_counts().index, y=column.value_counts())
ax.set_xlim(fences(column))
boxplot(base_airbnb['price'])
histogram(base_airbnb['price'])
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(column, hist=True)
I'm building a model for regular residential properties, I believe that values above the upper fence will only represent extremely luxurious apartments, which is not my main focus. Therefore, we I exclude these outliers.
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'price')
print('{} rows were removed'.format(removed_rows))
87282 rows were removed
histogram(base_airbnb['price'])
print(base_airbnb.shape)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(column, hist=True)
(810427, 23)
boxplot(base_airbnb['extra_people'])
histogram(base_airbnb['extra_people'])
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(column, hist=True)
I'm removing the outliers from this column too because the values are too much
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'extra_people')
print('{} rows were removed'.format(removed_rows))
59194 rows were removed
histogram(base_airbnb['extra_people'])
print(base_airbnb.shape)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(column, hist=True)
(751233, 23)
boxplot(base_airbnb['host_listings_count'])
bar_chart(base_airbnb['host_listings_count'])
We can exclude the outliers because, for the purpose of our project, hosts with more than 6 properties on Airbnb are not the target audience. I imagine they might be real estate investors or professionals managing properties on Airbnb.
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'host_listings_count')
print('{} rows were removed'.format(removed_rows))
97723 rows were removed
boxplot(base_airbnb['accommodates'])
bar_chart(base_airbnb['accommodates'])
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'accommodates')
print('{} rows were removed'.format(removed_rows))
13146 rows were removed
boxplot(base_airbnb['bathrooms'])
plt.figure(figsize=(15, 5))
sns.barplot(x=base_airbnb['bathrooms'].value_counts().index, y=base_airbnb['bathrooms'].value_counts())
<Axes: ylabel='bathrooms'>
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'bathrooms')
print('{} rows were removed'.format(removed_rows))
6894 rows were removed
boxplot(base_airbnb['bedrooms'])
bar_chart(base_airbnb['bedrooms'])
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'bedrooms')
print('{} rows were removed'.format(removed_rows))
5482 rows were removed
boxplot(base_airbnb['beds'])
bar_chart(base_airbnb['beds'])
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'beds')
print('{} rows were removed'.format(removed_rows))
5622 rows were removed
#boxplot(base_airbnb['guests_included'])
#bar_chart(base_airbnb['guests_included'])
print(fences(base_airbnb['guests_included']))
plt.figure(figsize=(15, 5))
sns.barplot(x=base_airbnb['guests_included'].value_counts().index, y=base_airbnb['guests_included'].value_counts())
(1.0, 1.0)
<Axes: ylabel='guests_included'>
I'm removing this feature from the analysis. It appears that Airbnb users frequently use the default value of 1 guest included. This can lead the model to consider a feature that is not actually essential for determining the price. Therefore, it seems better to exclude the column from the analysis
base_airbnb = base_airbnb.drop('guests_included', axis=1)
base_airbnb.shape
(622366, 22)
boxplot(base_airbnb['minimum_nights'])
bar_chart(base_airbnb['minimum_nights'])
Here I have an even stronger reason to exclude these apartments from the analysis.
I'm aiming to build a model that helps price regular apartments as an average person would like to list them. In the case of apartments with a "minimum nights" value greater than 8, they could be seasonal rentals or apartments for long-term living where the host requires a minimum stay of at least one month.
Therefore, let's exclude the outliers from this column
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'minimum_nights')
print('{} rows were removed'.format(removed_rows))
40383 rows were removed
boxplot(base_airbnb['maximum_nights'])
bar_chart(base_airbnb['maximum_nights'])
This column doesn't seem like it will contribute to the analysis.
That's because it appears that nearly all hosts do not fill in the "maximum nights" field, so it doesn't seem to be a relevant factor.
It's better to exclude this column from the analysis.
base_airbnb = base_airbnb.drop('maximum_nights', axis=1)
base_airbnb.shape
(581983, 21)
boxplot(base_airbnb['number_of_reviews'])
bar_chart(base_airbnb['number_of_reviews'])
print(base_airbnb['property_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='property_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
Apartment 458354 House 51387 Condominium 26456 Serviced apartment 12671 Loft 12352 Guest suite 3621 Bed and breakfast 3472 Hostel 2665 Guesthouse 2155 Other 1957 Villa 1294 Townhouse 969 Aparthotel 693 Chalet 481 Earth house 468 Tiny house 457 Boutique hotel 447 Hotel 376 Casa particular (Cuba) 298 Cottage 230 Bungalow 207 Dorm 185 Cabin 141 Nature lodge 124 Castle 80 Treehouse 76 Island 54 Boat 53 Hut 40 Campsite 34 Resort 31 Camper/RV 24 Yurt 23 Tent 18 Tipi 17 Barn 15 Farm stay 13 Pension (South Korea) 9 Dome house 8 Igloo 6 In-law 6 Vacation home 4 Timeshare 3 Pousada 3 Houseboat 3 Casa particular 2 Plane 1 Name: property_type, dtype: int64
Here, my action is not to "exclude outliers", but rather to group values that are very small.
All property types that have fewer than 2,000 occurrences in the database I will group them into a category called "others". I believe this will simplify the model.
home_type_table = base_airbnb['property_type'].value_counts()
group_columns = []
for types in home_type_table.index:
if home_type_table[types] < 2000:
group_columns.append(types)
print(group_columns)
for types in group_columns:
base_airbnb.loc[base_airbnb['property_type']==types, 'property_type'] = 'Others'
print(base_airbnb['property_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='property_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
['Other', 'Villa', 'Townhouse', 'Aparthotel', 'Chalet', 'Earth house', 'Tiny house', 'Boutique hotel', 'Hotel', 'Casa particular (Cuba)', 'Cottage', 'Bungalow', 'Dorm', 'Cabin', 'Nature lodge', 'Castle', 'Treehouse', 'Island', 'Boat', 'Hut', 'Campsite', 'Resort', 'Camper/RV', 'Yurt', 'Tent', 'Tipi', 'Barn', 'Farm stay', 'Pension (South Korea)', 'Dome house', 'Igloo', 'In-law', 'Vacation home', 'Timeshare', 'Pousada', 'Houseboat', 'Casa particular', 'Plane'] Apartment 458354 House 51387 Condominium 26456 Serviced apartment 12671 Loft 12352 Others 8850 Guest suite 3621 Bed and breakfast 3472 Hostel 2665 Guesthouse 2155 Name: property_type, dtype: int64
print(base_airbnb['room_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='room_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
Entire home/apt 372443 Private room 196859 Shared room 11714 Hotel room 967 Name: room_type, dtype: int64
print(base_airbnb['bed_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='bed_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
Real Bed 570643 Pull-out Sofa 8055 Futon 1634 Airbed 1155 Couch 496 Name: bed_type, dtype: int64
# grouping categories of bed_type
bed_table = base_airbnb['bed_type'].value_counts()
group_columns = []
for types in bed_table.index:
if bed_table[types] < 10000:
group_columns.append(types)
print(group_columns)
for types in group_columns:
base_airbnb.loc[base_airbnb['bed_type']==types, 'bed_type'] = 'Others'
print(base_airbnb['bed_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='bed_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
['Pull-out Sofa', 'Futon', 'Airbed', 'Couch'] Real Bed 570643 Others 11340 Name: bed_type, dtype: int64
# grouping categories of cancellation_pollicy
cancellation_table = base_airbnb['cancellation_policy'].value_counts()
group_columns = []
for types in cancellation_table.index:
if cancellation_table[types] < 10000:
group_columns.append(types)
print(group_columns)
for types in group_columns:
base_airbnb.loc[base_airbnb['cancellation_policy']==types, 'cancellation_policy'] = 'strict'
print(base_airbnb['cancellation_policy'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='cancellation_policy', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
['strict', 'super_strict_60', 'super_strict_30'] flexible 258096 strict_14_with_grace_period 200743 moderate 113281 strict 9863 Name: cancellation_policy, dtype: int64
Since we have a wide variety of amenities, and sometimes these amenities can be written differently, I will assess the quantity of amenities as the parameter for the model.
base_airbnb.shape
(581983, 21)
base_airbnb['n_amenities'] = base_airbnb['amenities'].str.split(',').apply(len)
base_airbnb = base_airbnb.drop('amenities', axis=1)
base_airbnb.shape
(581983, 21)
Now we can analyze the column n_amenities just like the way the other numerical columns were analyzed:
boxplot(base_airbnb['n_amenities'])
bar_chart(base_airbnb['n_amenities'])
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'n_amenities')
print('{} rows were removed'.format(removed_rows))
24343 rows were removed
I'm now creating a map that displays a random subset of our database (50,000 properties) to see how the properties are distributed throughout the city and also identify areas with higher prices.
sample = base_airbnb.sample(n=50000)
map_center = {'lat':sample.latitude.mean(), 'lon':sample.longitude.mean()}
map_chart = px.density_mapbox(sample, lat='latitude', lon='longitude',z='price', radius=2.5,
center=map_center, zoom=10,
mapbox_style='stamen-terrain')
map_chart.show()
I will now adjust the features to facilitate the work of the future model.
For True or False values, I will replace True with 1 and False with 0.
For categorical features (features where the column values are texts), I will use the method of encoding variables as dummies.
#Replacing true with 1 and false with 0
tf_columns = ['host_is_superhost', 'instant_bookable', 'is_business_travel_ready']
base_airbnb_cod = base_airbnb.copy()
for column in tf_columns:
base_airbnb_cod.loc[base_airbnb_cod[column]=='t', column] = 1
base_airbnb_cod.loc[base_airbnb_cod[column]=='f', column] = 0
print(base_airbnb_cod.iloc[0])
host_is_superhost 1 host_listings_count 2.0 latitude -22.965919 longitude -43.178962 property_type Condominium room_type Entire home/apt accommodates 5 bathrooms 1.0 bedrooms 2.0 beds 2.0 bed_type Real Bed price 270.0 extra_people 51.0 minimum_nights 4 number_of_reviews 205 instant_bookable 1 is_business_travel_ready 0 cancellation_policy strict year 2018 month 4 n_amenities 25 Name: 1, dtype: object
#Method of encoding variables as dummies
columns_categories = ['property_type', 'room_type', 'bed_type', 'cancellation_policy']
base_airbnb_cod = pd.get_dummies(data=base_airbnb_cod, columns=columns_categories)
display(base_airbnb_cod.head())
| host_is_superhost | host_listings_count | latitude | longitude | accommodates | bathrooms | bedrooms | beds | price | extra_people | ... | room_type_Entire home/apt | room_type_Hotel room | room_type_Private room | room_type_Shared room | bed_type_Others | bed_type_Real Bed | cancellation_policy_flexible | cancellation_policy_moderate | cancellation_policy_strict | cancellation_policy_strict_14_with_grace_period | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2.0 | -22.965919 | -43.178962 | 5 | 1.0 | 2.0 | 2.0 | 270.0 | 51.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 0 | 3.0 | -22.977117 | -43.190454 | 3 | 1.0 | 1.0 | 2.0 | 161.0 | 45.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 1 | 1.0 | -22.983024 | -43.214270 | 3 | 1.0 | 1.0 | 2.0 | 222.0 | 68.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 5 | 1 | 1.0 | -22.988165 | -43.193588 | 3 | 1.5 | 1.0 | 2.0 | 308.0 | 86.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 6 | 1 | 1.0 | -22.981269 | -43.190457 | 2 | 1.0 | 1.0 | 2.0 | 219.0 | 80.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
5 rows × 37 columns
Here I will use the R² metric, which tells how well the model can explain the price. This will be a great parameter to assess the model quality.
-> The closer to 100%, the better.
I will also calculate the Root-Mean-Squared Error (RMSE), which will show us how much the model is deviating from the actual values.
-> The smaller the error, the better
def evaluate_model(model_name, y_test, prediction):
r2 = r2_score(y_test, prediction)
RMSE = np.sqrt(mean_squared_error(y_test, prediction))
return f'Model {model_name}:\nR²:{r2:.2%}\nRMSE:{RMSE:.2f}'
These are some of the models available for numerical value prediction (regression). Since it's needed to calculate the price, which involves predicting a numerical value, I have chosen these three models.
rf_model = RandomForestRegressor()
lr_model = LinearRegression()
et_model = ExtraTreesRegressor()
models = {'RandomForest': rf_model,
'LinearRegression': lr_model,
'ExtraTrees': et_model,
}
y = base_airbnb_cod['price']
X = base_airbnb_cod.drop('price', axis=1)
This step is crucial. Artificial Intelligence learns from training.
Basically, what I do is separate the data into training and testing sets. For example, allocating 10% of the dataset for testing and 90% for training (usually, the training set is larger).
Next, it's needed to provide the training data to the model, allowing it to analyze that data and learn how to predict prices.
Once the model has learned, it's necessary to evaluate its performance by testing it with the testing data. It's possible to determine the best model by analyzing the results from the testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)
for model_name, model in models.items():
#training
model.fit(X_train, y_train)
#testing
prediction = model.predict(X_test)
print(evaluate_model(model_name, y_test, prediction))
Model RandomForest: R²:96.87% RMSE:46.88 Model LinearRegression: R²:33.26% RMSE:216.61 Model ExtraTrees: R²:97.41% RMSE:42.68
Model Chosen as Best Model: ExtraTressRegressor
This was the model with the highest R² value and at the same time the lowest RMSE value. Since we did not have a significant difference in training and prediction speed between this model and the RandomForest model (which had similar R² and RMSE results), we will choose the ExtraTrees Model.
The linear regression model did not yield satisfactory results, with R² and RMSE values much worse than the other 2 models.
#print(et_model.feature_importances_)
#print(X_train.columns)
importance_features = pd.DataFrame(et_model.feature_importances_,X_train.columns)
importance_features = importance_features.sort_values(by = 0, ascending = False)
display(importance_features)
plt.figure(figsize=(15, 5))
ax = sns.barplot(x=importance_features.index, y=importance_features[0])
ax.tick_params(axis = 'x', rotation = 90)
| 0 | |
|---|---|
| bedrooms | 0.124351 |
| latitude | 0.092095 |
| longitude | 0.086285 |
| extra_people | 0.082756 |
| n_amenities | 0.074254 |
| bathrooms | 0.067721 |
| number_of_reviews | 0.067232 |
| room_type_Entire home/apt | 0.064895 |
| accommodates | 0.062930 |
| minimum_nights | 0.062276 |
| beds | 0.046582 |
| host_listings_count | 0.036521 |
| instant_bookable | 0.021183 |
| cancellation_policy_flexible | 0.018356 |
| property_type_Apartment | 0.012602 |
| cancellation_policy_moderate | 0.011831 |
| host_is_superhost | 0.010735 |
| year | 0.010714 |
| cancellation_policy_strict_14_with_grace_period | 0.008723 |
| property_type_House | 0.006926 |
| property_type_Condominium | 0.004862 |
| month | 0.004443 |
| room_type_Private room | 0.003599 |
| bed_type_Real Bed | 0.002536 |
| bed_type_Others | 0.002490 |
| property_type_Others | 0.002241 |
| property_type_Loft | 0.002232 |
| property_type_Serviced apartment | 0.002136 |
| room_type_Shared room | 0.001870 |
| property_type_Bed and breakfast | 0.001268 |
| property_type_Guesthouse | 0.000893 |
| cancellation_policy_strict | 0.000878 |
| property_type_Guest suite | 0.000625 |
| property_type_Hostel | 0.000618 |
| room_type_Hotel room | 0.000341 |
| is_business_travel_ready | 0.000000 |
base_airbnb_cod = base_airbnb_cod.drop('is_business_travel_ready', axis=1)
y = base_airbnb_cod['price']
X = base_airbnb_cod.drop('price', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)
et_model.fit(X_train, y_train)
prediction = et_model.predict(X_test)
print(evaluate_model('ExtraTrees', y_test, prediction))
Model ExtraTrees: R²:97.41% RMSE:42.70
There's almost no impact in the model, so it'll be better to keep it like this so the model can run faster
Before:
R²: 97.41%
RMSE: 42.68
Now:
Model ExtraTrees:
R²:97.41%
RMSE:42.70
test_base = base_airbnb_cod.copy()
for column in test_base:
if 'bed_type' in column:
test_base = test_base.drop(column, axis = 1)
print(test_base.columns)
y = test_base['price']
X = test_base.drop('price', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)
et_model.fit(X_train, y_train)
prediction = et_model.predict(X_test)
print(evaluate_model('ExtraTrees', y_test, prediction))
Index(['host_is_superhost', 'host_listings_count', 'latitude', 'longitude',
'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price',
'extra_people', 'minimum_nights', 'number_of_reviews',
'instant_bookable', 'year', 'month', 'n_amenities',
'property_type_Apartment', 'property_type_Bed and breakfast',
'property_type_Condominium', 'property_type_Guest suite',
'property_type_Guesthouse', 'property_type_Hostel',
'property_type_House', 'property_type_Loft', 'property_type_Others',
'property_type_Serviced apartment', 'room_type_Entire home/apt',
'room_type_Hotel room', 'room_type_Private room',
'room_type_Shared room', 'cancellation_policy_flexible',
'cancellation_policy_moderate', 'cancellation_policy_strict',
'cancellation_policy_strict_14_with_grace_period'],
dtype='object')
Model ExtraTrees:
R²:97.39%
RMSE:42.86
Just like in the previous analysis, there were hardly any changes to the model, and yet it was possible to remove more features to make the process faster. Therefore, I will keep the model as it is.
Before:
R²: 97.41%
RMSE: 42.70
Now:
Model ExtraTrees:
R²:97.39%
RMSE:42.86
X['price'] = y
X.to_csv('data.csv')
import joblib
joblib.dump(et_model, 'model.joblib')
['model.joblib']
By using joblib, it was possible to create a file that contains the entire trained model so that it does not need to be retrained when opening the file. This is of utmost importance to create the website that will make predictions of properties prices.